Add analysis scripts to flores200 dataset#705
Open
klei22 wants to merge 10 commits intoReaLLMASIC:masterfrom
Open
Add analysis scripts to flores200 dataset#705klei22 wants to merge 10 commits intoReaLLMASIC:masterfrom
klei22 wants to merge 10 commits intoReaLLMASIC:masterfrom
Conversation
added 10 commits
December 23, 2025 18:49
Each script now has a common argument for stats_json, which emits the number of tokens not transcribed (those which will be held for byte tokenization). Added an espeak2ipa.py script which can target any of the espeak languages, and defaulting this to target shan for now.
There was a problem hiding this comment.
Pull request overview
This pull request adds a comprehensive suite of analysis, visualization, and processing scripts for the Flores-200 restructured dataset. The changes introduce utilities for language-script analysis, IPA phoneticization with byte coverage statistics, tokenization comparison, and interactive visualizations.
Key changes:
- Added byte coverage statistics tracking to all IPA transcription scripts (Chinese, Korean, Japanese, English, and generic espeak-based)
- Created new plotting utilities for visualizing dataset sizes grouped by script/region/family and comparing tokenization methods
- Introduced shell scripts to automate graph generation and IPA processing workflows
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
data/template/utils/zh_to_ipa.py |
Enhanced with byte coverage stats tracking for transcribed vs non-transcribed content |
data/template/utils/ko_en_to_ipa.py |
Added stats tracking and helper functions for Korean IPA transcription |
data/template/utils/ja2ipa.py |
Integrated byte coverage statistics and simplified conditional logic |
data/template/utils/en2ipa.py |
Added stats tracking with thread-safe accumulation for English transcription |
data/template/utils/espeak2ipa.py |
New generic IPA transcription tool supporting any espeak-ng voice with multithreading |
data/flores200-res/plot_langscript_sizes_grouped.py |
Visualizes language-script sizes grouped by region/script/family |
data/flores200-res/plot_multi_script_languages.py |
Plots languages appearing in multiple scripts with fixed color mapping |
data/flores200-res/plot_tokenization_vs_original.py |
Compares tokenized vs original text sizes across methods |
data/flores200-res/plot_ipa_vs_text.py |
Analyzes IPA vs raw text sizes with optional tokenization comparison |
data/flores200-res/tokenize_and_annotate_sizes.py |
Tokenizes files and annotates JSON with tokenized sizes |
data/flores200-res/spm_vocab_freq_dashboard.py |
Interactive HTML dashboard for SentencePiece vocabulary analysis |
data/flores200-res/filter_files_by_script.py |
Extracts script/language fields from files.json |
data/flores200-res/phoneticize.sh |
Updated to process multiple languages with stats output |
data/flores200-res/graphs.sh |
Automates generation of various dataset visualizations |
data/flores200-res/ipa_scripts.sh |
Shell script for IPA vs text comparison workflows |
data/flores200-res/tokenize.sh |
Wrapper for tokenization with tiktoken |
data/flores200-res/tokenization_vs_origina.sh |
Generates tokenization ratio plots (filename has typo) |
data/flores200-res/get_dataset.sh |
Expanded language array with additional languages |
data/flores200-res/*.json |
Added stats files and filtered dataset entries |
data/flores200-res/README.md |
Documentation for scripts, license, and language code references |
data/flores200-res/.gitignore |
Excludes generated PNG files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import argparse | ||
| import re | ||
| import json | ||
| from typing import List, Optional, Dict, Any, Tuple |
There was a problem hiding this comment.
Import of 'Tuple' is not used.
Suggested change
| from typing import List, Optional, Dict, Any, Tuple | |
| from typing import List, Optional, Dict, Any |
|
|
||
| import argparse | ||
| import json | ||
| import os |
There was a problem hiding this comment.
Import of 'os' is not used.
Suggested change
| import os |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces several new scripts and updates for working with the Flores-200 restructured dataset, focusing on language-script analysis, phoneticization, and visualization. It adds new Python utilities for filtering and plotting dataset statistics, updates shell scripts for processing and analyzing data, and includes new documentation and stats files.
New analysis and plotting scripts:
plot_langscript_sizes_grouped.py, a Python script for visualizing language-script dataset sizes grouped and colored by script, region, or family. It includes logic for mapping scripts to regions and families and generates grouped bar plots.filter_files_by_script.py, a Python utility to extract relevant fields fromfiles.jsonfor script/language analysis, outputting a simplified JSON for downstream analysis.Shell script improvements and additions:
phoneticize.shto process multiple languages usingespeak2ipa.py, with improved logic for handling multiple files and stats output.graphs.shandipa_scripts.shshell scripts to automate generation of visualizations and IPA/text comparisons using the new Python plotting utilities. [1] [2]Documentation and dataset updates:
README.mddescribing the scripts, dataset license, and language code references for the restructured Flores-200 dataset.eng_stats.json,ja_stats.json,ko_stats.json) reporting transcription statistics for English, Japanese, and Korean data. [1] [2] [3]Miscellaneous:
.gitignoreto exclude PNG files generated by plotting scripts.These changes provide a more robust framework for analyzing, processing, and visualizing the Flores-200 restructured dataset, making it easier to work with language-script data and phonetic transcriptions.